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Abstract 

Motivated by the need, in some Bayesian likelihood free inference 
problems, of imputing a multivariate counting distribution based on its 
vector of means and variance-covariance matrix, we define a generic 
multivariate discrete distribution. Based on blending the Binomial, 
Poisson and Negative-Binomial distributions, and using a normal mul- 
tivariate copula, the required distribution is defined. This distribution 
tends to the Multivariate Normal for large counts and has an approx- 
imate pmf version that is quite simple to evaluate. 
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1 Introduction 

We develop a generic discrete multivariate distribution denned in terms of 
its mean and covariate matrix only. The multivariate Normal distribution 
is denned in such terms and would be the default option as an inputed dis- 
tributions when only the mean and covariance matrix are available for an 
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otherwise unknown distribution. However, there is no alternative when con- 
sidering discrete data, specially in the case of low counts where a Normal 
approximation is not feasible. The motivation to defining such discrete dis- 
tribution is as follows. 

The development and analysis of mathematical epidemic models that take 



into account uncertainty is an active field of research 
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tance of this field of research is apparent given its potential impact on public 
health policies to handle emergent and re-emergent infectious diseases such 
as dengue fever, Lyme disease, tuberculosis, flu, etc. It is known that the 
effects of local (demographic) stochasticity weight more in determining the 
dynamics of epidemics when the number of individuals in the population is 
low. On the other hand, parameter estimation is among the standard tools 
to explore the predictive capacity of models from partial observation of the 
state variables. In this context, it is specially important to devise methods 
to study the predictive capacity of the mathematical models, in particular, 
to quantify the uncertainties. 

For the sake of clarity we use the simplest epidemic model, e.g. the SIR 
model without vital dynamics. Let the random variables X\, X 2 and A 3 
denote the number of Susceptible, Infected and Recovered individuals in a 
closed population at a given time t, respectively. The stochastic model is 
defined by the processes through which it evolves: 



Xl + X 2 — 7 2^2 
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If we denote by Xi, x 2 and x 3 a realization of the random variables X\, X 2 and 
A 3 , and let P%{t) = P(xi,x 2 ,x 3 ,t) be the probability that the system is in 



state (xi, x 2 , x 3 ) at time t, then the chemical master equation (van Kampen 



1992) for this system is given by 
dP(x 1 ,x 2 ,x 3 ,t) 



dt 



{x x + 1) 
b —^—(x 2 



l)P(Xi + l,x 2 - l,x 3 ,t) 



+ b x [x 2 + l)P(x 1 + l,x 2 - 1, x 3 - 1, t) 
- ( b o~Q X i + hx 2 )P{x 1 ,x 2 , x 3 , t), 
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where the constant f2 denotes the total number of individuals, and b and b\ 
are respectively the contact rate and the rate of loss of infectiousness. 

Applying the Inverse Size Expansion (van Kampen, 1992) to equation [T] 
leads to equations for the expected value and the fluctuations of X\, x 2 and 
x 3 . The Fokker-Plank approximation is then used leading to an approx- 
imation for the mean E t (Xi) and cross products E t (XiXj) for all species 
i,j = 1, 2, ... ,k at each time t Chen and Bokka ( 2005[ ). As far as the dis- 
tribution for the X/s is concern we know to be dealing with counting data 
Xi 6 N; % — 1, 2, . . . , k and, for a fixed time t, we have (an approximation for) 
their means //j = E t (Xi), variances = E t (Xf) — fi? and their correlations 

_ Et(XiXj)-/iiiij 

It is possible to simulate directly from the true model P(-,t) above (for 
fixed 60 and 61) to simulate realizations of (Xi(t), X 2 (t), X 3 (t)) ( Gillespie] 



2007). Data is commonly only available for X 2 (t) and for specific epidemics 
(eg. Dengue fiver) there is substantial prior expert information for the con- 
tact rate, b , and the rate of loss of infectiousness, b\. It would be possible 
to use the ABC algorithm (Marin et al. , 2011) to make Bayesian inferences 
about 60 and 61 but the simulation procedure becomes very slow for mod- 
erate population sizes Vt and still the ABC approach lacks a formal theo- 
retical foundation (Marin et al.| |2011 , p. 4). Instead, using the moment 



approximation explained above (that indeed depends on the parameters of 
interest b and 61), we impute a counting (discrete) distribution on the ob- 
servables (commonly X 2 {t) but in some situations all Xi(t), X 2 (t), X 3 (t) are 



observed unknonw, 1978) matching those moments to create a likelihood. 



The computational complexity of this likelihood is in fact independent of Q 
and, using elicited priors and MCMC, a Bayesian inference is possible for any 
set of population sizes. We have had already promising results along these 
lines and will publish such research in an specialized journal of the field. 

We will assume that the correlation matrix p = (pij) is positively defined 
and, certainly, /ij, Vi > 0. Based on this information only, we need to impute 
a discrete distribution for the observables (Xi, X 2 ,X 3 ) that would be defined 
by these moments. Here we propose such generic multivariate distribution 
for counting data. In the next section we explain the univariate version, 
which is a simple combination of the default distributions commonly used for 
counting data, namely, the Binomial, the Poisson and the Negative Binomial. 
In Section [3] we create the multivariate version using a Normal copula and 
in Section [4] present some examples. 
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2 A Univariate Generic Discrete Distribution 

Gd(m, v) 

The Poisson, Binomial and Negative Binomial distributions are simple form 
distributions and first candidates for counting data models. For any mean 
and variance fi, v > we make a combination of these three distributions in 
the following way 

{C™ p x (l — p) m ~ x ; x — 0,l,...m if fi > v 
e~ v ^; xeN if n = v (2) 

Qx+m-i pc(i _ p y m . xefq ii i_i<v 

where p — 1 — min ( — ,—), m = r-^— r and C™ are the combinations of m 
items taken in subsets of size x. That is, we use a Binomial if /i > t> , a Poisson 
if // = v and a Negative-Binomial if fi < v . Neither of these distributions can 
handle any mean and variance; by combining these distributions we obtain 
the Generic Discrete class Gd(/i, v) defined for arbitrary mean \i and variance 
v, /i, v > 0, and these two moments completely define the distribution. 

Indeed, it is straightforward to see that if X ~ Gd(fi,v), E(X) = fi and 
V(X) = v. More importantly, for a fixed mean //, given both the properties 
of the Binomial and the Negative-Binomial, we see that 

_ v x 

lim Gd(x kt, v) — e v — . 
v->n " x\ 

Therefore we have a continuous evolution of this parametric class, being the 
Poisson the "continuous bridge" between the Binomial (fi > v) and Negative 
Binomial (fi < v). (Note that if fi > v and v — >■ fi, the support will increase 
to cover all N since m — > oo.) 

Moreover, if X ~ Gd(fi,v), will tend to a standard Normal distri- 
bution if fi — > oo and p — >■ p G (0, 1). That is, for large fi (and for example 
fi — 3y/v > 0) Gd(fi,v) can be approximated with a N(fi,v). Practical 
guidelines for approximating the Poisson, Binomial and Negative-Binomial 
distributions should be used when calculating the cdf, pmf etc. of Gd(fi,v). 
We then see that the Gd(fi, v) family evolves to a Normal distribution when 
fi is large (large counts). 
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3 The Multivariate Case 



Suppose we have k = 2 discrete distributions and we let n = (pi, p 2 )' and 
v = (vi, v 2 )' be their vector of means and variances and p = pi >2 . We require 
a bivariate discrete distribution, defined in terms of pL, v and p, such that the 
marginal distribution for each X{ is Gd(pi,Vi) and the resulting correlation 
between X\ and X 2 is (at least approximately) p. 

We use a Normal Copula (see Nelsen, 2006, chap. 2) to create a a joint 
distribution. Let 



(f) p {s,t) 



1 



2W1 



exp 



[s 2 - 2pst + t 2 
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be the bivariate standard normal distribution with correlation p and $(s) 
the standard normal cdf. The normal Copula is defined as 



C p (u,v) 



[s, t)dsdt. 



We define the joint cdf of X\ and X 2 as 

fa)), 

where F^ v is the cdf of a Gd(p, v). F^^ ^xi, x 2 ) defines the Generic Discrete 
distribution in dimension 2, Gd 2 (/J,,v, p), and it is straightforward to verify 
that its pmf is 

Gd 2 (xi,X2\fJ,,v,p) = C p (F^ uVl (xi),F^ )V2 (x 2 )) - C p (F lil)Vl (xi-l),F IMl)V2 (x2)) - (3) 

(x 2 - 1)) + C p {F fluVl (x 1 - 1) 

fa - 1)) 

(that is, the pmf are the corresponding jumps in the stepped cdf F ll ^ vp {x\,x 2 )). 
Since C p (u,v) is a Copula for every — 1 < p < 1, the marginal distributions 
will be precisely Gd(p\,Vi) and Gd(p 2 ,v 2 ), as required (Nelsen, 2006). 

By pretending Fp uVi (xi) is differentiable, with derivative Gd(xi\pi,Vi), 
'differentiating' F Mjt , jP (xi, x 2 ) suggest the following approximation to the pmf 

Gd 2 (xi,x 2 \fi,v, p) pa gd 2 (xx,x 2 \fi,v, p) = K f p Gd{x\\pi, v\)Gd{x 2 \p 2 , v 2 ) 



<j>(s)c/>(t) 



(4) 
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were s = t = $ _1 (F M2 ^ 2 (x 2 )) and </>(•) is the standard Nor- 

mal pdf, for some normalization constant K. This approximation will prove 
useful even for p^ small, and is far less computationally demanding than the 
exact version in ([3]). 

If F^ uVi {xi) were cdf's of N(pii,Vi), that is F^ itVi (xi) = $ (^f 1 ), we im- 
mediately see that the correlation between X\ and X 2 is p. In the general 
case as each marginal distribution Gd{p il Vi) becomes similar to a Normal 
distribution the actual correlation will be approximately p and Gd 2 becomes 
increasingly similar to a Bivariate Normal distribution with the correct mo- 
ments. 

Finally, the multivariate generic discrete distribution is defined as 
Gd n (x\p,v,p) = / ••• / (27r)^/ 2 |p|- 1 / 2 exp { --s' p-'s [ ds, ■ ■ ■ ds n . 

The corresponding approximate pmf would be 

gd n (x\n,v, p) = K ^ P ^ l ,\ S2 l ' ' ' , -^ L ^ Gd(x 1 \/ii, v 1 )Gd(x 2 \p2, v 2 ) ■ ■ ■ Gd(x n \p n , v n ), 

0(Sl)0(s 2 ) ■ ■■<p{s n ) 

with Si = (&*))• 



4 Example 

We present four examples of Gd 2 , see Figure [T] and Table [1} We compare the 
approximation with the exact distributions by calculating numerically the 
exact pmf in ^ and the approximate pmf, gd 2 , in ^ over a relevant grid 
of the support for each example. The approximation seems to be very good 
option and far less computationally demanding. Moreover, the moments 
match correctly and both Gd 2 and gd 2 have an actual correlation quite near 
the required one (compare p with p' and p* in Figure [I]). It is also very 
remarkable that the normalization constant needed for the approximation is 
quite close to 1. This will potentially enable the use of gd n as an alternative, 
less computationally demanding likelihood, by considering K to depend only 
marginally on the mean and variance-covariance matrix. 
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Pi 




M2 


V2 


P 


P' 


K 


a*; 




lA 


vl 


P* 


(a) 


50 


25 


50 


25 


0.5 


0.5144 


0.99 


50.25 


24.99 


50.25 


24.99 


0.5009 


(b) 


50 


25 


60 


25 


0.7 


0.7129 


0.99 


50.35 


25.04 


59.85 


24.76 


0.7013 


(c) 


10 


25 


11 


4 


0.4 


0.3988 


0.99 


9.45 


23.30 


10.90 


3.85 


0.3876 


(d) 


10 


5 


5 


10 


0.7 


0.6869 


0.97 


10.23 


4.68 


5.43 


10.43 


0.6670 



Table 1: Gd,2, resulting correlation, p', for the exact and resulting moments 
for the approximate (*) pmf of Gd,2, gd-2, with various parameters. Contour 
plots for the exact and approximate pmf 's of distribution (a)-(d) may be seen 
in Figure [Tj The normalization constant needed for g>cf 2 is given in column 
K. 



5 Discussion 

We develop a generic discrete multivariate distribution defined in terms of its 
vector of means and variance-covariance matrix only, as it is the case for the 
Multivariate-Normal distribution for continuous data. This distribution has 
applications in the Bayesian analysis of complex models were we are dealing 
with counting data and the correct likelihood is not available analytically, 
but approximation techniques can be developed to obtain moments of ob- 
servables. This is the case when studying epidemics using the SIR stochastic 
model, as explained in Section [T] The distribution developed here can now 
be used as a default distribution to be imputed to multivariate counting data 
in such situations. Moreover, when large counts are involved this distribution 
tends to a Multivariate Normal, (eg. Figure [ija) and (b)). 
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